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Abstract 

We study the problem of determining the optimal low dimensional projection for maximising 
the separability of a binary partition of an unlabelled dataset, as measured by spectral graph the¬ 
ory. This is achieved by finding projections which minimise the second eigenvalue of the Laplacian 
matrices of the projected data, which corresponds to a non-convex, non-smooth optimisation prob¬ 
lem. We show that the optimal univariate projection based on spectral connectivity converges to 
the vector normal to the maximum margin hyperplane through the data, as the scaling parameter 
is reduced to zero. This establishes a connection between connectivity as measured by spectral 
graph theory and maximal Euclidean separation. It also allows us to apply our methodology to 
the problem of finding large margin linear separators. The computational cost associated with 
each eigen-problem is quadratic in the number of data. To mitigate this problem, we propose 
an approximation method using microclusters with provable approximation error bounds. We 
evaluate the performance of the proposed method on a large collection of benchmark datasets 
and find that it compares favourably with existing methods for projection pursuit and dimension 
reduction for unsupervised data partitioning. 
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1 Introduction 


The classification of unlabelled data is fundamental to many statistical and machine learning appli¬ 
cations. Such applications arise in the context of clustering and semi-supervised classification. Un¬ 
derpinning these tasks is the assumption of a clusterable structure within the data, and importantly 
that this structure is relevant to the classification task. The assumption of a clusterable structure, 
however, begs the question of how a cluster should be defined. Centroid based methods, such as the 
ubiquitous /c-means algorithm, define clusters in reference to single points, or centers (Leisch, 2006[). 


In the non-parametric statistical approach to clustering, clusters are associated with the modes of a 


probability density function from which the data are assumed to arise (Hartigan 1975, Chapter 11). 


We consider the definition as given in the context of graph partitioning, and the relaxation given 
by spectral clustering. Spectral clustering has gained considerable interest in recent years due to 
its strong performance in diverse application areas. In this context clusters are defined as strongly 
connected components of a graph defined over the data, wherein vertices correspond to data points 


and edge weights represent pairwise similarities (von Luxburg, 20071. 


The minimum cut graph problem seeks to partition a graph such that the sum of the edges 
connecting different components of the partition is minimised. To avoid partitions containing small 
sets of vertices, a normalisation is introduced which helps to emphasise more balanced partitions. 
The normalisation, however, makes the problem NP-hard (Wagner and Wagner 19931, and so a 


continuous relaxation is solved instead. The relaxed problem, known as spectral clustering, is solved 
by the eigenvectors of the graph Laplacian matrices. We give a brief introduction to spectral clusering 
in Section |3] 
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Crucial to all cluster definitions is the relevance of spatial similarity of points. In multivariate data 
analysis, however, the presence of irrelevant or noisy features can significantly obscure the spatial 
structure in a data set. Moreover, in very high dimensional applications the curse of dimensionality 


can make spatial similarities unreliable for distinguishing clusters (Steinbach et al. 2004 Beyer et al. 


19991. Dimension reduction techniques seek to mitigate the effect of irrelevant features and of the 


curse of dimensionality by finding low dimensional representations of a set of data which retain as 
much information as possible. Most commonly these low dimensional representations are defined by 
the projection of the data into a linear subspace. Information retention is crucial for the success of any 
subsequent tasks. For unsupervised classification this information must, therefore, be relevant in the 
context of cluster structure. Classical dimension reduction techniques such as principal component 
analysis (PCA) cannot guarantee the structural relevance of the low dimensional subspace. Moreover 
a single subspace may not suffice to distinguish all clusters, which may have their structures defined 
within differing subspaces. Recently a number of dimension reduction methods with an explicit 


objective which is relevant to cluster structure have been proposed (Krause and Liebscher 
et ah, 2011 Pavlidis et al. 20161. We discuss these briefly in Section 


2005 Niu 


We consider the problem of learning the optimal subspace for the purpose of data bi-partitioning, 
where optimality is measured by the connectivity of the projected data, as defined in spectral graph 
theory. We formulate the problem in the context of projection pursuit; a class of optimisation problems 
which aim to find interesting subspaces within potentially high dimensional data sets, where interest¬ 
ingness is captured by a predefined objective, called the projection index. With very few exceptions, 
the optimisation of the projection index does not admit a closed form solution, and is instead numer¬ 
ically optimised. The projection indices considered in the proposed method are the second smallest 
eigenvalues of the graph Laplacians, which measure the quality of a binary partition arising from the 
normalised minimum cut graph problem. These eigenvalues are non-smooth and non-convex, and so 
specialised techniques are required to optimise them. We establish conditions under which they are 
Lipschitz and almost everywhere continuously differentiable, and discuss how to find local optima 
with guaranteed convergence properties. 

In this paper we establish an asymptotic connection between optimal univariate subspaces for bi¬ 
partitioning based on spectral graph theory, and maximum margin hyperplanes. Formally, we show 
that as the scaling parameter defining pairwise similarities is reduced to zero, the optimal univariate 
subspace for bi-partitioning converges to the subspace normal to the largest margin hyperplane through 
the data. This establishes a theoretical connection between connectivity as measured by spectral graph 
theory and maximal Euclidean separation. It also provides an alternative methodology for learning 


maximum margin clustering models, which have attracted considerable interest in recent years (Xu 


et al., 2004 Zhang et al. 20091. We introduce a way of modifying the similarity function which 


avoids focusing on outliers, and allows us to further control the balance of the induced partition. The 


importance of controlling this balance has been observed in the context of large margin clustering (Xu 


et al. 

2004 

Zhang et al. 

2009 

1 and low density separators ( 

Pavlidis et al. 

2016 


The computation cost associated with the eigen-problem underlying our projection index is quadratic 
in the number of data. To mitigate this computational burden we propose a data preprocessing step 
using micro-clusters which significantly speeds up the optimisation. We establish theoretical error 
bounds for this approximation method, and provide a sensitivity study which shows no degradation 
in clustering performance, even for a coarse approximation. 

The remainder of the paper is organised as follows. In Section we briefly discuss related work 
on dimension reduction for unsupervised data partitioning. A brief outline of spectral clustering is 
provided in Section Section presents the methodology for finding optimal projections to per¬ 
form binary partitions. Section describes the theoretical connection between optimal subspaces for 
spectral bi-partitioning and maximum margin hyperplanes. In Sectionwe discuss an approximation 
method in which the computational speed associated with finding the optimal subspace can be sig¬ 
nificantly improved, with provable approximation error bounds. Experimental results and sensitivity 
analyses are presented in Sectionwhile Section]^ is devoted to concluding remarks. 
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2 Related Work 


The literature on clustering high dimensional data is vast, and we will focus only on methods with 
an explicit dimension reduction formulation, as in projection pursuit. Implicit dimension reduction 
methods based on learning sparse covariance matrices (which impose an implicit low dimensional 
projection of the data/clusters), such as quadratic discriminant analysis, can be limited by the as¬ 
sumption that clusters are determined by their covariance matrices. Projection pursuit approaches 
can be made more versatile by defining objectives which admit more general cluster definitions. 

Principal component analysis and independent component analysis have been used in the context 
of clustering, however their objectives do not correspond exactly with those of the clustering task and 
the justification of their use is based more on common-sense reasoning. Nonetheless, these methods 


have shown good empirical performance on a number of applications (Boley, 1998 Tasoulis et al. 


2010 Kriegel et ah, 20091. Some recent approaches to projection pursuit for clustering rely on the 


non-parametric statistical notion clusters, i.e., that clusters are regions of high density in a probability 


distribution from which the data are assumed to have arisen. 

Krause and Liebscher 

(2005 

) proposed 

using as projection index the dip statistic ( 

Hartigan and Hartigan 

1985 

) of the projected data. The 


dip is a measure of departure from unimodality, and so maximising the dip tends to projections which 
have mutlimodal marginal density, and therefore separate high density clusters. The authors establish 
that the dip is differentiable for any projection vector onto which the projected data are unique, and 
use a simple gradient ascent method to find local optima. 

The minimum density hyperplane approach (Pavlidis et al., 2016) is posed as a projection pursuit 
for the univariate subspace normal to the hyperplane with minimal integrated density along it, thereby 
establishing regions of low density which separate the modes of the underlying probability density. 
The projection index in this case is the minimum of the kernel density estimate of the projected 
data, penalised to avoid hyperplanes which do not usefully split the data. The authors show an 
asymptotic connection between the hyperplane with minimal integrated density and the maximum 
margin hyperplane. The result we show in Section therefore establishes that the optimal subspace 
for bi-partitioning based on spectral connectivity is asymptotically connected with the minimum 
integrated density hyperplane. 


A number of direct approaches to maximum margin clustering have also been proposed (Xu et al. 


2004 Zhang et al. 2009). These can be viewed as a projection pursuit for the subspace normal 
to the maximum margin hyperplane intersecting the data. The iterative support vector regression 


approach (Zhang et al. 2009) uses support vector methods and so for the linear kernel explicitly 


learns the corresponding projection vector, v. 


Most similar to our work is that of Niu et al. (2011), who also proposed a method for dimension 
reduction based on spectral clustering. The authors show an interesting connection between optimal 
subspaces for spectral clustering and sujficient dimension reduction. For the case of a binary partition, 
their objective is equivalent to one of the objectives we consider, i.e., that of minimising the second 
smallest eigenvalue of the normalised Laplacian (cf. Sections and |^ . However, our methodology 
differs substantially from theirs. Niu et ah] (2011[) define their objective by 


max 

u,w 

tra.ce{U^ D-^/'^AD-^^'^U) 

(la) 

s.t. 

II 

h 

(lb) 


Aij = s{\\W^Xi - W^XjW) 

(Ic) 


II 

h 

(Id) 


The matrix A is the affinity matrix containing pairwise similarities of points projected into the subspace 
defined by W, and D is the diagonal degree matrix of A, with i-th diagonal element equal to the i- 
th row sum of A. Further details of these objects can be found in Section The approach used 
by the authors to maximise this objective alternates between using spectral clustering to determine 
the columns of U, and then using a gradient ascent method to maximise trace(C/^Il“^/^Ai4“^/^t7) 
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over W, where the dependence of this objective on the projection matrix W is through Eq. 
Within this gradient ascent step the matrices U and D are kept fixed. This process is iterated until 
convergence. However, the authors do not address the fact that the matrix D is determined by A, 
and therefore depends on the projection matrix W. An ascent direction for the objective assuming 
a fixed D is therefore not necessarily an ascent direction for the overall objective. Despite this fact 
the method has shown good empirical performance on a number of problems (Niu et al.| |2011| . In 


Section]^ we derive expressions for the gradient of the overall objective, which allows us to optimise 
it directly. 


3 Background on Spectral Clustering 


In this section we provide a brief introduction to spectral clustering, with particular attention to 
bi-partitioning, which underlies the focus of this work. Bi-partitioning using spectral clustering has 
been considered previously by Shi and Malik ( |2000 ), where a full clustering can be obtained by 
recursively inducing bi-partitions of (subsets of) the data. With a data sample, X = {xi,..., xat}, 
spectral clustering associates a graph G = {V,E), in which vertices correspond to observations, and 
the undirected edges assume weights equal to the pairwise similarity between observations. Pairwise 
similarities can be determined in a number of ways, including nearest neighbours and similarity 
metrics. In general, similarities are determined by the spatial relationships between points, and pairs 
which are closer are assigned higher similarity than those which are more distant. 


The information in G can be represented by the adjacency, or affinity matrix, A G 


IxN 


with 


Aij = Eij. The degree of each vertex Vi is defined as, di = ■ The degree matrix, D, is then 

defined as the diagonal matrix with Tth diagonal element equal to di. For a subset G C X, the size 
of C can be defined either by the cardinality of G, \G\, or by the volume of G, vol(C) = J2i-x eC 


Definition The normalised min-cut graph problem for a binary partition is defined as the optimisation 
problem 


min 

Ccx 


i,j:Xi^C ^Xj^X\C 


( — ^ — 
\size(C') 


size(Ai \ G) 


( 2 ) 


It has been shown ( ]Hagen and Kahiig 1992 Shi and Malik[ 2000| that the two normalised min-cut 
graph problems (corresponding to the two definitions of size) can be formulated in terms of the graph 
Laplacian matrices. 


(standard) L = D — A, 
(normalised) Tnorm = 


as follows. For G C X define to be the vector with z-th entry, 

c _ f -\/size7^\C^y7size(^, if Xi G C 
\ — -\/size~(C)7size()AYC), if Xi G X \ G. 

For size(C') = |C|, the optimisation problem in ([^ can be written as, 

min (u^)^Lu^ s.t. u^ ± 1 , ||u‘^|| = ViV. 

Similarly, if size(C') = vol(C) the problem in Q is equivalent to, 

min (■u‘")^L'u‘" s.t. Du‘^ ± 1 , (u^)^£>u^ = vol(V). 


( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 

( 7 ) 
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second eigenvector of L and the second eigenvector of the generalised eigen equation Lu = XDu re¬ 
spectively, the latter thus equivalently solved by where u is the second eigenvector of Lnorm- 


In particular, we have 





X2{L) < 


(8) 


'^2('^norni) 

1 / N\Tr N 

vol(E)^ ^ ’ 

(9) 


where \ 2 {L) and A 2 (inorm) are the second eigenvalues of L and Lnorm and and are the solutions 
to (|^ and Q respectively. 

The following properties of the matrices L and Lnorm will be useful in establishing our proposed 


methodology and the associated theoretical results. These properties can be found in (von Luxburg 


2007, Propositions 2 and 3). 

1. For any v G we have 


Lv = 


Vi V-i 


V Aij j 




y/~^i \J dj 


( 10 ) 

( 11 ) 


2. L and Tnorm are symmetric and positive semi-definite. 

3. The smallest eigenvalue of L is 0 with corresponding eigenvector 1 , the constant 1 vector 

4. The smallest eigenvalue of Tnorm is 0 with corresponding eigenvector 

The extension of clustering via the normalised min-cut to a iF-partition of the data is similar, and 
can be solved approximately by the first K eigenvectors of either L or Tnorm (von Luxburg 20071. 


4 Projection Pursuit for Spectral Connectivity 


In this section we study the problem of minimising the second eigenvalue of the graph Laplacian 
matrices of the projected data. If the projected data are split in two through spectral clustering, then 
the projection that minimises the second eigenvalue of the corresponding graph Laplacian minimises 
the connectivity of the two components, as measured by spectral graph theory. Note that while 
we discuss explicitly the minimisation of the second eigenvalue, the methodology we present in fact 
applies to an arbitrary eigenvalue of the graph Laplacians. As a result, the method discussed herein 
trivially extends to the problem of determining a AT-partition by minimising the sum of the K smallest 
eigenvalues of the Laplacians. 

To begin with, let X = {xi, ...,xn} be a d-dimensional data set and let V G be a projection 
matrix, where / S N is the dimension of the projection, and the columns of V, {Vi,...,V/}, have 
unit norm. With this formulation it is convenient to consider a parameterisation of V through polar 
coorindates as follows. Let 0 = [0, and for 0 G 0, the projection matrix V(0) G is 

given by. 




cos(0ij) nfe=\ sm(0kj), i = 1,..., d - 1 
nfc=lsin(0fcj), i = d. 


( 12 ) 


From this we dehne the I dimensional projected data set by P{0) = {p{0)i, ...,p{0)m} = {V{0)^ xi ,..., 
V{0Yxn}, and we let L{0) (resp. Lnorm(^)) be the Laplacian (resp. normalised Laplacian) of the 
graph of P{0). Edge weights are determined by a positive function s : {R})^ x {1... N}'^ -G K+, in 
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that the affinity matrix is given by A{0)ij := s{P{0),i, j)- In the simplest case we may imagine s 
being fully determined by the Euclidean distance between two elements of the projected data, i.e., 
s{P{d),i,j) = k{\\p{d)i — p{d)j\\), for some function A: : K —>■ K+. However we prefer to allow for a 
more general definition. We discuss this further in Section [E3| 

Henceforth we will use Ai(-) to be the i-th (smallest) eigenvalue of its (in all cases herein) real 
symmetric matrix argument. The objectives X2{L{0)) and A 2 (Tnorm(^)) are, in general, non-convex 
and non-smooth in 6, and so specialised techniques are required to optimise them. In the following 
subsections we investigate their differentiability properties, and discuss how alternating between a 
naive gradient descent method and a descent step based on a directional derivative can be used to 
find locally optimal solutions. 


4.1 Continuity and Differentiability 

In this subsection we explore the continuity and differentiability properties of the second eigenvalue 
of the graph Laplacians, viewed as a function of the projection angle, 0. We will view the data set 
X as a d X N matrix with i-th column equal to Xi, and similarly the projected data set as an I x N 
matrix, P{0) = H(0)^Af, with f-th column p{0)i. 

Lemma 1 Let X = {xi, ...,X]s[} C and let s{P,i,j) be Lipschitz continuous in P G for fixed 

i, j G {1 ... N}. Then X2{L{0)) and A 2 (Lnorm(^)) are Lipschitz continuous in 0. 

Proof We show the case of L{0), where that of Lnorm(^) is similar. The result follows from the fact 
that L{0) is element-wise Lipschitz as a composition of Lipschitz functions {V{0) is Lipschitz in 0 as 
a collection of products of Lipschitz functions) and the fact that 


\XfiLi0))-XfiL{0'))\<\\L{0)-Li0')\\<N 


max \L{0) — L{0')\ 
b 


y > 


191II. 


where the first inequality is due to Weyl (1912), and the second comes from Schur’s inequality (Schur 


Rademacher’s theorem therefore establishes that both objectives are almost everywhere differen¬ 
tiable (Polak 1987). This almost everywhere differentiability can also be seen by considering that 


simple eigenvalues of real symmetric matrices are differentiable, e.g. [Magnus (1985), and establishing 
that under certain conditions on the function s the eigenvalues of L(0) and Lnorm(^) are simple for 
almost all 0. 


Tao and Vu (2014) have shown that the real symmetric matrices with non-simple spectrum lie in 


a subspace of co-dimension 2. If we denote the space of real valued N x N symmetric matrices by 
iSjv, and denote this subspace by S, then Sn \ 5” is open and dense in Sn- Sufficient conditions on the 
function s for the almost everywhere simplicity of X 2 {L{ 0 )) (resp. A 2 (Lnorm(^))) are therefore that it is 
continuous in P for each i,j and for all B G Sn and U open in Q,30 G U s.t. ti'ace{L{0)B) ^ 0 (resp. 
trace(Lnorm(^)H) ^ 0). Continuity of s ensures continuity of the functions X2{L{0)) and A 2 (Lnorm(^)), 
and therefore the openness of the preimage of Sn\S. The latter condition ensures that for each open 
U C Q, the span of the image of U under A 2 (-) is Sn- Therefore, in every open U C <d,30 G U s.t. 
L{0) ^ S. Therefore the pre-image of Sn \ S' is dense in 0. 


Generalised gradient based optimisation methods are the natural framework for finding the optimal 
subspace for spectral bi-partitioning. Eigenvalue optimisation is, in general, a challenging problem due 
to the fact that eigenvalues are not differentiable where they coincide. The majority of approaches in 
the literature focus on the problems of minimising the largest eigenvalue or the sum of a predetermined 
number of largest eigenvalues (Overton and Womersley, 19931. Both of these problems tend to lead to a 
coalescence of eigenvalues, making the issue of non-differentiability especially problematic. Conversely 
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the minimisation of the smallest eigenvalue tends to lead to a separation of eigenvalues, and so non¬ 
differentiability is less of a concern ( Lewis and Overton[ 1996| ). 

If the similarity function s is strictly positive, then X 2 {L{ 6 )) and A 2 (Lnorm(^)) are bounded away 
from zero, and hence minimising these has the same benefits as does minimising the smallest eigen¬ 
value in general, in that the corresponding optimisation tends to separate them from other eigenvalues. 
Despite this practical advantage, the simplicity of \ 2 [L{ 6 )) and A 2 (Lnorm(^)) is not guaranteed over 


the entire optimisation. We discuss a way of handling points of non-differentiability in Section 4.2 


This approach uses the directional derivative formulation given by Overton and Womersley (19931, 


and allows us to find descent directions which also tend to lead to a decoupling of eigenvalues. 


Global convergence of gradient based optimisation algorithms relies on the continuity of the deriva¬ 
tives (where they exist). To establish this continuity, we first derive expressions for the derivatives of 
X 2 {L{ 0 )) and A 2 (Lnorm(^)) as a function of 0. Theorem 1 of Magnus (1985) provides a useful formu¬ 
lation of eigenvalue derivatives. If A is a simple eigenvalue of a real symmetric matrix M, then A is 
infinitely differentiable on a neighbourhood of M, and the differential at M is given by 

dX = d{M)u, (13) 

where u is the corresponding eigenvector. Let us assume that s{P,i,j) is differentiable in P G 
for fixed i,j G {1... N}. For brevity we temporarily drop the notational dependence on 0 and denote 
the second eigenvalue of the Laplacian by A, and the corresponding eigenvector by u. The derivative 
DgX is given by the {d — 1) x I matrix with i-th column D^.X, where we consider the chain rule 
decomposition Dg^X = DpXDyPDg^V. Here D - is the differential operator. Since only the i-th 
column of V depends on 0i, and only the i-th row of P depends on Vi, this product can be simplified 
as Dg^X = Dp.XDviPiDg.Vi, where Pi is used to denote the i-th row of P, while Vi and 0i are, as 
usual, the i-th columns of V and 0 respectively. We provi de ex pressions for each of these terms below. 

We first consider the standard Laplacian L. By Eq. (13) we have dX = iiJd{L)u = vJd{D)u — 
vJd{A)u. Now, 


N 


dDii _ ^ dAij 


N 


dP, 


mn - , ^P-rr. 


^^ ds{P^t,j) ^ and 




dPrr 


dPrr 


ds{P,i,j) 

dPmn ’ 


and so, 


^A 


-T dL 1^^ ^ds{P,i,j) 

= U -TTT^U = - 2_^{Ui - Uj) - 


dP 

rr 




dP 

u± rr 


(14) 


(15) 


For the normalised Laplacian, Lnorm, consider first 

d(L„orm) = d(D-l/ 2 LD-l/ 2 ) 

= d(D-i/2)LI?-i/2 -H D-i/2d(D)D-i/2 _ D-^/^d{A)D-^/^ -t Ld{D-^/^). 

We will again use A and u to denote the second eigenvalue and corresponding eigenvector. Using the 
fact that LD~^/'^u = XD^/'^u, 

dX = d[D-^/‘^)LD-^/‘^uPu^ D-^/‘^d{D)D-^/‘^u-vJ D-'^/‘^d(A)D-^/‘^u 




-1/2, 


= XvJd{D-^/'^)D^/'^u + vJD-^/'^d{D)D-^/'^u - vJD-^/'^d{A)D-^ 

PXvJ D^/^d{D-^/'^)u 

= Xu^d{I)u -f (1 - X)u^D-^/'^d{D)D-^/'^u - D-^/‘^d{A)D-^/'^u, 

since d{D-^l'^)DD-^/‘^ + D-^/‘^d{D)D-^/'^ + D-i/^Dd)!!-^/^) = ^(D-i/^DD-i/S) = d{I) = 0, we 
have, 

dX = {l-X)u^ D-^/‘^d{D)D-^/‘^u-u^ D-^/‘^d{A)D-^/'^u 
= u^D-^/'^d{L)D-^/'^u - Xu^D-^/'^d{D)D-'^/'^u. 
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Therefore, 


dX 


1 


= E 


Ut 


ds{P,i,j) ds{P,i,j) 


(16) 


dPmn ^/dj ) dPmn 

The component DviPi is simply the N x d transposed data matrix, and the dx (d — 1) matrix, Dg.Vi, 
is given by 


Dey^ = 


-sin(0ii) 
cos(0h) cos(02*) 


0 

- sin(0H) sin(02*) 


cos(0H)nfc=2sin(0fc») cos(02*)nfc^2sin(0fc*) ■ 

Having derived expressions for the derivatives of X2{L{0)) and A 2 (Tnorm(^)), we can address their 
continuity properties. The components DyPDgV clearly form a continuous product in 9. The conti¬ 
nuity of the elements dXjdPmn can be reduced to addressing the continuity of the eigenvalue itself, of 
its associated eigenvector and a continuity assumption on the derivative of the function s. It is well 
known that the eigenvalues of a matrix are continuous, while the continuity of the elements of the 


cos(0d_i,j)nLi sin(0fcj) 


• (17) 


eigenvector come from the fact that we have assumed A to be simple (Magnus 1985). We provide full 
expressions for the derivatives of X2{L{6)) and A 2 (Tnorm(^)), for the similarity function used in our 
experiments, in Appendix A. 


The eigenvalues of a real symmetric matrix can be expressed as the difference between two convex 
matrix functions ( Fan[ 1949[ ). If the similarity function, s, is Lipschitz continuous and differentiable we 
therefore have that \2{L{9)) and A 2 (Tnorm(^)) are directionally differentiable everywhere. Overton 
and Womersley (1993) describe a way of expressing the directional derivative of the sum of the k 


largest eigenvalues of a matrix whose elements are continuous functions of a parameter, at a point of 
non-simplicity of the /c-th largest eigenvalue. We will discuss the case of X2{L{9)), where A 2 (Tnorm(^)) 
is analogous. If we denote the sum of the k largest eigenvalues of L{0) by F^{0) then, 

X2{L{0))=F^-\0)-F^-‘^{0). (18) 

Now suppose that 0 is such that 

Xj^{L{0)) > ... > AAr_r+l(T(^)) > 

XN-r{L{0)) = ... = XN-k+l{L{0)) = ... = XM-r-t+l{L{0)) > 

XN-r-tiL{0)) > ... > Xi{L{0)). 

That is, the fc-th largest eigenvalue has multiplicity t and k — r are i ncluded in the sum defining F^( 0). 
Then the directional derivative of F^{0) in direction 9 is given by (Overton and Womersley, 1993) 


d-l i 


d-l i 


F^'{9- %trace(i?^Liji?) 
i=l i=i 


max 

ue<s>t,k- 


EE 0ijtiace{Q^ 


UjQU), 


(19) 


i=i j=i 


where Ly = dL{0)/d0ij, the matrix R G 


has j-th column equal to the eigenvector of the j-th 


largest eigenvalue of P(0) and the matrix Q G has j-th column equal to the eigenvector of the 

(r -I-j)-th largest eigenvalue of P(0). In addition the set <I>a,b is defined as, 

d>a,6 := {P G Sa\U and I — U are positive semi-definite and trace(17) = b}. (20) 

Overton and Womersley (1993) have shown that F^'{0; 9) is the sum of the eigenvalues of 
YlfiZi 9ijEpLijR plus the sum of the k — r largest eigenvalues of LijQ. There¬ 

fore, the directional derivative of X2{L{0)) in the direction 9 is given by the smallest eigenvalue of 
dijO^LijQ, where the matrix Q is constructed by any complete set of eigenvectors cor¬ 
responding to the eigenvalue A = X2{L{0))- 

























4.2 Minimising \ 2 {L( 6 )) and A 2 (i^norm(^))- 


Applying standard gradient descent methods to functions which are almost everywhere differentiable 
can result in convergence to sub-optimal points (Wolfe 1972). This occurs when the method for de¬ 


termining the gradient is applied at a point of non-differentiability and results in a direction which is 
not a descent. In addition, gradients close to points of non-differentiability may be poorly conditioned 
from a computational perspective leading to poor performance of the optimisation. 


The second eigenvalues of the graph Laplacian matrices, while not differentiable everywhere, ben¬ 
efit from the fact that their minimisation tends to lead to a separation from other eigenvalues. Thus a 
naive gradient descent algorithm tends to perform well. Notice also that if u S with |jM|| = 1 and 
w T 1 is such that vJL{0)u = X 2 {L{ 0 )) for some 0 G 0, then for any 0' G 0 with vJL{0')u < L{0)u 

we have \2{L{0')) < X2{L{0)), since L{0')u is an upper bound for X2{L(0')). Thus even if X2{L(0)) 
is a repeated eigenvalue, a descent direction for L{0)u is a descent direction for X 2 {L{ 0 )), where u 
is any corresponding eigenvector. However, this property does not necessarily hold for A 2 (Tnorm(^)) 
since the first eigenvector of Tnorm(^) depends on 0 , and thus the second eigenvector u will not nec¬ 
essarily be orthogonal to the first eigenvector of Tnorm(^A- 


We assume that the similarity function, s, is Lipschitz continuous and continuously differentiable 
in P for each i,j, and hence the Laplacian matrices L{0) and Lnorm(^) are element-wise Lipschitz 
continuous and continuously differentiable in 0. These conditions are sufficient for the everywhere 
directional differentiability of X 2 {L{ 0 )) and A 2 (Lnorm(^))- Our approach for finding locally minimal 
solutions for X 2 {L{ 0 )) and A 2 (Lnorm(^)) can be seen as a simplification of the general method of Over- 
ton and Womersley| (1993). Our method alternates between a naive application of a standard gradient 


based optimisation algorithm, in which the simplicity of the second eigenvalue is assumed to hold ev¬ 
erywhere along the optimisation path, and a descent step which (in general) decouples the second 
eigenvalue. We again discuss only X 2 {L{ 0 )) explicitly, where A 2 (Lnorm(^)) is analogous. A description 
of the method is found in Algorithm Notice that upon convergence of a gradient descent algorithm 
which assumes the simplicity of X 2 {L{ 0 )), if X 2 {L{ 0 )) is simple then the solution is a local minimum, 
and so the algorithm terminates. If X 2 {L{ 0 )) is not simple, then the solution may or may not be a local 
minimum. As we discuss in Section 4.1 if 0 is such that X2{L(0)) is not simple, then the directional 
derivative of X 2 {L{ 0 )) in direction 6 is given by the smallest eigenvalue of 

where Q is the matrix with columns corresponding to a complete set of eigenvectors for A = X 2 {L{ 0 )), 
and Lij = dL{0)/d0ij. If LijQ = 0 for all i = l,...,(i— l;j = 1,...,?, then @ is a local mini¬ 
mum and the method terminates, otherwise 30 G 0 s.t. Ai LijQ^ < 0, and thus 

0 is a descent direction for X 2 {L{ 0 )). It is possible to find a locally steepest descent direction by 
minimising Ai (py Efci Ei=i over 0 , however the added computational cost associ¬ 

ated with this subproblem outweighs the benefit over a simply chosen unit coordinate vector. Notice 
that the directional derivative of Xk+ 2 (L{ 0 )) in direction 0 is given by the {k + l)-th eigenvalue of 


J2i=i dijQ^LijQ, for fc = 0,1,..., t— 1, where t is the multiplicty of the eigenvalue A = X 2 {L{ 0 )). 
Therefore if there exists i G {1,..., d — 1}, j G {1,..., 1} s.t. Xt{Q^LijQ) > 0 and is simple then —etj, 
where —Cij is the (d — 1 ) x I matrix with zeros except in the z, j-th entry where it takes the value 1 , 
is a descent direction and dy > 0 s.t. X 2 {L {0 — ^eij)) < X'i{L{0 — 'y'cij)) for all 0 < 7 ' < 7 . On the 
other hand if Xi{Q^LijQ) < 0 and is simple, then is such a descent direction. If no such pair i,j 
exists, then we select i,j which maximises max{Xt{Q^LijQ),—Xi{Q^LijQ)} and set 0 = —Cij if the 
maximum was determined by the largest eigenvalue and equal to e^- otherwise. 
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Algorithm 1: Minimising X2(L{9)) 

1. Initialise 6. 

2. Apply gradient based optimisation to X2{L{6)) assuming differentiability 

3. if X 2 {L{d)) is simple then 
return 9 

4. Find Q G a complete set of t eigenvectors for eigenvalue A = X2{L{9)). 

Find Lij = dL(9)/d9ij for i = 1,..., d — 1, and j = 1, I 

5. if LijQ = 0 Vi = 1,..., d — 1; j = 1, I then 
return 9 

6. if 3i G {1, ..., d — 1}; j G {1,/} s.t. Xt{Q^LijQ) > 0 and is simple then 
9 a-rgming,X2{L{9')) s.t. 9' = 9 — 'ycij, 7 > 0, X2{L{9')) is simple 

go to 2. 

7. if 3i G {1, ..., d — 1}; j G {1, ..., 1} s.t. Xi{Q^LijQ) < 0 and is simple then 
9 c- argming/A 2 (F(@')) s.t. 9' = 9 + "fCij, 7 > 0, X2{L{9')) is simple 

go to 2. 

8 . (J, J) ^ argmax(^_^) m.a■i^{Xt{Q^L^Q), -Xi{Q^L^Q)} 

9. if Xt{Q^LijQ) > -Xi{Q^LjjQ) then 

9 argming/A 2 (F( 0 ')) s.t. 9' = 9 — 'yejj, 7 > 0 

go to 4. 

10 . 9 argming,X2{L{9')) s.t. 9' = 9 + 'yeij, 7 > 0 

go to 4. 

11. end 


4.3 Computing Similarities 

It is common to define pairwise similarities of points via a decreasing function of the distance between 
them. That is, for a decreasing function k : K’*' —>■ IR"'', the similarity function s may be written, 


s{P,i,j) = k 


( d{pi,Pj)\ 

I ^ r 


( 21 ) 


where d(-, •) is a metric and cr > 0 is a scaling parameter. We have found that the projection pursuit 
method which we propose can be susceptible to outliers when the standard Euclidean distance metric 
is used, especially in the case of minimising X2{L{9)). In this subsection we discuss how to embed a 
balancing constraint into the distance function. By including this balancing mechanism the projection 
pursuit is steered away from projections which result in only few data being separated from the 
remainder of the data set. 

While the normalisation of the graph cut objective, given in ([^, is extremely effective in emphasis¬ 
ing balanced partitions in the general spectral clustering problem ( |von Luxburg 20071, we have found 
that in the projection pursuit formulation a further emphasis on balance is sometimes required. This 
is especially the case in high dimensional applications. Consider the extreme case where d > N. Then 
the projection equation, X = P, is an underdetermined system of linear equations. Therefore for 
any desired projected data set P there exist @ G 0,c G K \ {0} s.t. V{9)^X = cP. In other words 
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the projected data can be made to have any distribution, up to a scaling constant. In particular we 
can generally find projections which induce a suSicient separation of a small group of points from the 
remainder of the data that the normalisation in ([^ is inadequate to obtain a balanced partition. We 
have observed that in practice even for problems of moderate dimension this situation can occur. The 
importance of including a balancing constraint in the context of projection pursuit for clustering has 
been observed previously by Zhang et al. (20091 and Pavlidis et al. (2016). 

Emphasising balanced partitions is achieved through the use of a compact constraint set A, which 
may be defined using the distribution of the projected data set P. By defining the metric d(-,’) in 
such a way that distances between points extending beyond A are reduced, we increase the similarity 
of points outside A with others. If P is Z dimensional then we define A as the rectangle A = 
where each A^ is an interval in K which is defined using the distribution of the j-th component of 
P. A convenient way of increasing similarties with points lying outside A is with a transformation 
Ta : —>■ K*, defined as follows. 


'7A(y) = (tAi(yi), ...,tA,(2/i)) > 



(22) 

1 

f ^min Ai-2:+((j (1-h)) 

1—5 1-5 

+5(5(l-5))^. 

z<min Ai 


tA.(z):= j 

z-minAj, 

[ 5 — max Ai + (i5 (1 —(5))^^ 

— 6{S (1 —5))^~ +Diam(Ai), 

zeAi 

z>max A A 

(23) 


where S € (0, .5] is the distance reducing parameter. Each is linear on A^ but has a smaller 
gradient outside A^. As a result we have ||rA(a;) — T'a(?/)|| < Hx — y\\ for any x,y G M}, with strict 
inequality whenever either x or y lies outside A. We define Ta in this way so that it is continuously 
differentiable even at the boundaries of A, and so does not affect the differentiability properties of the 
similarity function, s. Figure [^illustrates how the function Ta influences distances and similarities in 
the univariate case. 

In the context of projection pursuit it is convenient to define a full dimensional convex constraint 
set A C and define the univariate constraint intervals, which we now index by the corresponding 
projection angles, via the projection of A onto each V{6)i. That is, 

Agj := [min{E(0)7 x\x G A}, max{I/(0)7 x\x G A}] . (24) 

In our implementation, we define A to be a scaled covariance ellipsoid centered at the mean of the 
data. The projections of A are thus given by intervals of the form, 

= + (25) 


where ygi and ugi are the mean and standard deviation of the z-th component of the projected data 
set P{d) and the parameter /3 > 0 determines the width of the projected constraint interval A^i. 

Figurej^shows two dimensional projections of the 64 dimensional optical recognition of handwritten 
digits dataselQ The leftmost plot shows the PCA projection which is used as initialisation for the 
projection pursuit. The remaining plots show the projections arising from the minimisation of X 2 {L{ 6 )) 
for a variety of values of j3. For /3 = oo, i.e., an unconstrained projection, the projection pursuit focuses 
only on a few data and leaves the remainder of the dataset almost unaffected by the projection. Setting 
P = 2.5 causes the projection pursuit to focus on a larger proportion of the tail of the data distribution. 
Setting P = 1.5 however allows the projection pursuit to identify the cluster structure in the data and 
find a projection which provides a good separation of three of the clusters in the data, i.e., those 
shown as black, orange and turquoise in the top right plot. 


4.4 Correlated and Orthogonal Projections 

The formulation of the optimisation problem associated with projection pursuit based on spectral 
connectivity places no constraints on the projection matrix, V, except that its columns should have 

^https://archive.ics.uci.edu/ml/datasets.html 
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Figure 1: Effect of Ta on Distances and Similarities. 


A 




(a) 


(b) 



(c) (d) 

(a) The univariate data set P is plotted against the transformed data T^{P). The point at ^ 15 lies outside A and its 
distance to other points, e.g. the point at Ri 2, is smaller within T^{P) (vertical axis) than in P (horizontal axis), (b) 

The kernel density estimate of the transformed data T^{P) (-) has a stronger bimodal structure, i.e., two well 

defined clusters, than that of P (-), which has multiple small modes caused by outliers. The connection between 

spectral connectivity and density based clustering has been investigated theoretically by [Narayanan et ah] | |2006| l 
showing that the normalised graph cut is asymptotically related to the density on the separating surface, (c) The 
affinity matrix of the data set P has a weaker cluster structure than that of T^(P), shown in (d). 


unit norm. A common consideration in dimension reduction methods is that the columns in the 
projection matrix should be orthogonal, i.e., Vj =0 for all i^j. In the context of projection 
pursuit it is common to generate the columns of V individually, so that orthogonality of the columns 
can easily be enforced. Huber (1985) proposes first learning Vl using the original data, and then for 
each subsequent column the data are first projected into the null space of all the columns learnt so far. 
An alternative approach (Niu et al. 20111, is to instead project the gradient of the objective into this 
null space during a gradient based optimisation. Notice, however, that orthogonality in the columns 
of V is not essential for the underlying problem. Another common approach (Huber 1985) is to 


transform the data after each column is determined in such a way that the columns learned so far are 
no longer “interesting”, i.e., have low projection index. This does not enforce orthogonality, but rather 
steers the projection pursuit away from the columns already learned by making them unattractive to 
the optimisation method. 

We propose a different approach which allows us to learn all of the columns of V simultaneoulsy. 
This is achieved by introducing an additional term to the objective function which controls the level 
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Figure 2: Two dimensional projections of optical recognition of handwritten digits dataset arising 
from the minimisation of X 2 {L{ 6 )), for different values of /3. In addition, the initialisation through 
PCA is also shown. The top row of plots shows the true clusters, while the bottom row shows resulting 
bi-partitions. 



of orthogonality, or alternatively correlation, within the projection matrix. In particular, we consider 
the objective, 

min A 2 (L W) + u;Y,{V{ 0 )J (26) 


or replacing X 2 {L{ 6 )) with A 2 (Tnorm(^)) in the normalised case. This approach serves a dual purpose. 
In the first case, setting w > 0 leads to approximately orthogonal projections, without the need to 
optimise separately over different projection vectors as is standard. Alternatively, setting w < 0 leads 
to approximately perfect correlation, i.e., V{9)i ±V{9)j for all i,j. In the latter case the result¬ 

ing projection is therefore equivalent to a univariate projection. This is similar to simultaneously 
considering multiple initialisations, and allowing the optimisation procedure to select from them au¬ 
tomatically. This is important as the objectives X 2 {L{ 9 )) and A 2 (Tnorm(^)) are non-convex, and as a 
result applying gradient based optimisation can only guarantee convergence to a local optimum. 


Notice also that the formulation in Eq. (26) offers computational benefits over the alternative 


of optimising separately over each projection vector, since the eigenvalues/vectors computed in each 
function and gradient evaluation need only be computed once for each iteration over the multiple 
projection dimensions. 


5 Connection with Maximal Margin Hyperplanes 


In this section we establish a connection between the optimal univariate projection for spectral bi¬ 
partitioning using the standard Laplacian and large margin separators. In particular, under suitable 
conditions, as the scaling parameter tends to zero the optimal projection for spectral bi-partitioning 
converges to the vector admitting the largest margin hyperplane through the data. This establishes 
a theoretical connection between spectral connectedness and separability of the resulting clusters 
in terms of Euclidean distance. Large margin separators are ubiquitous in the machine learning 
literature, and were first introduced in the context of supervised classification via support vector 


machines (SVM, Vapnik and Kotz (1982)). In more recent years they have shown to be very useful 
for unsupervised partitioning in the context of maximum margin clustering as well (Xu et al. 2004 
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Zhang et al., 20091. 


Our result pertains to univariate projections, and therefore the d x 1 projection matrix is equiv¬ 
alently viewed as a projection vector in We therefore use the notation v{d), instead of V{d) as 
before. 


The result holds for all similarities for which the function k, in Eq. (21 1 , satishes the tail condition 


lim 2 ;_>oo k{x + e)/k{x) = 0 for all e > 0. This conditions is satisfied by functions with exponentially 
decaying tails, including the popular Gaussian and Laplace kernels. It is, however, not satisfied by 
those with polynomially decaying tails. 

The constraint set A again plays an important role as in many cases the largest margin hyperplane 
through a set of data separates only a few points from the rest, making it meaningless for the purpose of 
clustering. We therefore prefer to restrict the hyperplane to intersect the set A. What we in fact show 
in this section is that there exists a set A' C A satisfying A' D X — AO X, such that, as the scaling 
parameter tends to zero, the optimal projection for \2{L{9)) converges to the projection admitting the 
largest margin hyperplane that intersects A'. The distinction between the largest margin hyperplane 
intersecting A' and that intersecting A is scarcely of practical relevance, but plays an important role 
in the theory we present in this section. It accounts for situations when the largest margin hyperplane 
intersecting A lies close to its boundary and the distance between the hyperplane and the nearest 
point outside A is larger than to the nearest point inside A. Aside from this very specific case, the 
two in fact coincide. 

A hyperplane is a translated subspace of co-dimension 1, and can be parameterised by a non-zero 
vector V { 0 } and scalar b as the set H{v, b) = {x G = b}. Clearly, for any c G K \ {0}, one 

has H{v, b) = H(cv, cb), and so we can assume that v has unit norm, thus the same parameterisation 
by 0 can be used. For a finite set of points X C the margin of hyperplane H{v{0), b) w.r.t. X is 
the minimal Euclidean distance between H{v{0),b) and X. That is. 


margin(u( 0 ), b) = min |u( 0 ) x — b\. 

xGX 


(27) 


Connections between maximal margin hyperplanes and Bayes optimal hyperplanes as well as mini¬ 
mum density hyperplanes have been established (Tong and Roller 2000| Pavlidis et al. 2016). 


In this section we will use the notation X = {v^xi, xn}, and for a set Pcffi and y G K 
we write, for example, P^y for Pn {y,oo). For scaling parameter ct > 0 and distance reduction fac¬ 
tor (5 > 0 define 6cr.s ■= argminggQA 2 (L( 0 , a, < 5 )), where L{0, a, S) is as L{6) from before, but with an 
explicit dependence on the scaling parameter and distance reducing parameter used in the similarity 
function. That is, defines the projection generating the minimal spectral connectivity of X for a 
given pair cr, S. 


Before proving the main result of this section, we require the following supporting results. Lemma 
provides a lower bound on the second eigenvalue of the graph Laplacian of a one dimensional data 
set in terms of the largest Euclidean separation of adjacent points, with respect to a constraint set 
A. This lemma also shows how we construct the set A'. Lemma [s] uses this result to show that 
a projection angle 0 G 0 leads to lower spectral connectivity than all projections admitting smaller 
maximal margin hyperplanes intersecting A' for all pairs cr, S sufficiently close to zero. 

Lemma 2 Let k : IR+ -G IR_|_ be a non-increasing, positive function and let a > 0,6 G (0,0.5]. Let P = 
{pi, ...,pn} be a univariate data set and let A = [a, b] for a < 5 G K. Suppose that |P n A| > 2 and a > 
min{P}, b < max{P}. Define A' = [a', b'], where a' = {a min{P n A})/2, &' = (6 -|- max{P n A})/2. 
Let M = maxa;gA'{iiiinj=i..jv |a; — pi|}. Define L{P) to be the Laplacian of the graph with vertices 
P and similarities according to s{P,i, j) = k{\T/^{pi) — T/^{pj)\/a). Then A 2 (P(P)) > j^pfc((2M-|- 
SC)/a-), where C = max{D, D^~^}, P = max{a — min{P}, max{P} — &}. 

Proof We can assume that P is sorted in increasing order, i.e. pi <Pi+i, since this does not affect 
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the eigenvalues of L{P). We first show that s{P, i,i + l)> k{{2M + SC)/a) for all i = l,N — 1. 
To this end observe that — —^(h(l — 5))< <5max{x, x^~^} for a; > 0. 


• lipi,pi+i < athens(P,j,i + 1) = fc((TA(pi+i) - TA{pi))/(j) > fc((TA(a) - TA{pi))/cr) > k{{2M + 
SC) I a) by the definition of C and using the above inequality, since k is non-increasing. The 
case Pi,Pi+i >b is similar. 

• If p^,Pi+l e A then p^,Pi+i G A' ^ \Pi - Pi+i \ <2.M ^ s{P,i,i + 1) > k{2M/a) > 
k{{2M + SC)/a) since M is the largest margin in A'. 


• If none the above hold, then we lose no generality in assuming pi < a, a < Pi+i < b since the case 
a < Pi <b, Pi+i > b is analogous. We must have pi+i = min{P n A} and so a' = {a + pi+i)/2. If 
Pi+i — a> 2M then minj=i..jv \a' — Pj \ > M, a contradiction since a' G A' and M is the largest 
margin in A'. Therefore pi+i — a< 2M. In all TA(pi+i) — T'a(k) = (pi+i — a) + S{a — Pi + 
(<5(1 - - <5(<5(1 -S))'-^ <2M + SC^ s{P, i, 5 -f 1) > fc((2M -t <5C)/cr). 

Now, let u be the second eigenvector of L{P). Then ||u|| = 1 and uPl and the refore 3i,j s.t . Uj — 


Ui > 




We thus know that there exists m s.t. 


m+l I ^ |p|3/2 • 


By (von Luxburg 


2007 


Proposition 1), we know that vJ L{P)u = | s(P, i,j){ui — Uj)^ > s(P, m,in + l)(itm — Um+i)^ ^ 


j^k{{2M -\-SC)la) since all consecutive pairs Pm, Pm+i have similarity at least k{{2M ■ 
by above. Therefore \ 2 {L{P)) > j^k{{2M + SC)/a) as required. 


■SC)/a), 


In the above Lemma we have assumed that A is contained within the convex hull of the points P, 
however the results of this section can easily be modified to allow for cases where this does not hold. 
In particular, if an unconstrained large margin hyperplane is sought, then setting A to be arbitrarily 
large allows for this. We have merely stated the results in the most convenient context for our practical 
implementation. 

The set A' in the above is defined in terms of the one dimensional constraint set [a, b]. We define 
the full dimensional set A' along the same lines by. 


A' := {a;GM'^|p(6l)^a;G A^ V^G©}, 

, [min Ae -I- min{r;(0)^A n Ag} max Ag-|-max{w(0)^A n A^} 


(28) 


Here we assume that A is contained within the convex hull of the (i-dimensional data set X. Notice 
that since A is convex, we have v{d)^A' = Ag. In what follows we show that as a and S are reduced to 
zero the optimal projection for spectral partitioning converges to the projection admitting the largest 
margin hyperplane intersecting A'. If it is the case that the largest margin hyperplane intersecting A 
also intersects A', as is often the case, although this fact will not be known, then it is actually not nec¬ 
essary that S tend towards zero. In such cases it only needs to satisfy S < 2M / C for the corresponding 
values of M and C over all possible projections. In particular, choosing max{Diam(A), Diam(A)^“'^} 
instead of C is appropriate for all projections. 


Lemma 3 Let @ G 0 and let k : M+ —>■ M+ he non-increasing, positive, and satisfy 

lim k{x{l€))/k{x) = 0 

X—¥00 

for all e > 0. Then for any 0 <m < max^gA^ niargin{v{ 6 ), b) there exists tr' > 0 and <5' > 0 s.t. 0 < 
a <a', 0 < S < S' and max^gA', rnargin{v{d'),c) < maxf,gA' margin{v{ 6 ),b) — m X 2 {L{ 6 ,a,S)) < 
X2{L{0',a,S)). 
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Proof Let B = argmaX{,g^^margin(ii(0), h) and M = margin(z;(0), B). We assume that M ^ 0, since 
otherwise there is nothing to show. Now, since spectral clustering solves a relaxation of the minimum 
normalised cut problem we have, 


Xj 


- 51 s{p{e),i,j) 

i,j:v{d)~'^ Xi<.B 
v{B)~^Xj '>B 


{\{v{eyx)^B\ 


\{v{evx)^B\ 


J_ V k ( TAMGV^o)-TAMoV^^) \ ( 1^1 

„ \ ^ )\\[v{0yxusmeyx)yB\ 

i,j\v{d) Xi<B 
v{9)~^Xj >B 


= k{2M/a). 


The hnal inequality holds since for any z,j s.t. v{0y Xi <Bandv{0yxj > i? we must have Ta^ 

i) > 2M. Now, for any 0' € 0, let Mg/ =maXcgA^, margin(n(0'), c). By Lemma we 
know that \2{L{0',a,5)) > p^fc((2Me/ +5C/(j), where C = max{Diam(X), Diam(X)^“^}. There¬ 
fore, 

_ \2{L{0,ay)) _ ^ |X|3fc(2M/a) 

cr-)-o+,5->-o+ infe/ge{A 2 (L(@', cr, 5))\Mbi <M — m} ~ cr->.o+.(5-j-o+ k{(2{M — m) -|- SC)/a) 

= 0 . 


This gives the result. 

Lemma shows almost immediately that the margin admitted by the optimal projection for 
spectral bi-partitioning converges to the largest margin through A' as a and 6 go to zero. The main 
result of this section, Theorem]^ shows the stronger result that the optimal projection itself converges 
to the projection admitting the largest margin. 


Theorem 4 Let X = {a:i,..., xat} and suppose that there is a unique hyperplane, which can be pa- 
rameterised by {v{0*), b*), intersecting A' and attaining maximal margin on X. Let k : IR_|_ —> K_|_ be 
decreasing, positive and satisfy lim 2 ,_).oo ^((1 + e)x)/k{x) = 0 for all e > 0. Then, 


lim v(0as) = r(0* 
cr^0+,5^0+ ' ’ " 


Pavlidis et al. 


Proof Take any e > 0. 

||(w,c)/||w|| - (■i;(0*),6*)|| > e^margin(w/||w||,c/||u; 
know 3a' > 0, i5' > 0 s.t. if 0 < cr < cr' and 0 < (5 < 5' then 3c G A, 
margin(n(0*), b*) — m^, since0^,5 is optimal for the pair a, S. Thus, by above, ||(n(0o..5), c) — {v{0*), 5*)| 
<e. But ||(u(0^,5),c) - (w(0*),6*)|| > - u(0*)|| for any c G K. Since e > 0 was arbitrary, we 

therefore have v{0a,5) —> v{0*) as cr, <5 —> 0+. 


(|2016|) have shown that > 0 s.t. for w G c G K, 
< margin(u(0*), 5*) — me- By Lemma we 
g s.t. margin(c( 0 CT ,s)x) > 


While the results of this section apply only for univariate projections, we have observed empirically 
that if a shrinking sequence of scaling parameters is employed for a multivariate projection, then the 
projected data tend to display large Euclidean separation. This is illustrated in Figure]^ which shows 
two dimensional plots of the 72 dimensional yeast cell cycle analysis dataselQ As in Figure the 

^ http://genome-www.Stanford.edu/cellcycle/ 
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Figure 3: Large Euclidean separation of the yeast cell cycle dataset. The left plots show the result 
from a 2 dimensional projection pursuit using the proposed method. The middle plots show the 
1 dimensional projection pursuit result. The right plots show the result of the maximum margin 


clustering method of Zhang et al. (2009). 



top plots show the true clusters in the data and the bottom plots show the clustering assignments. 
The left plots show the result of a two dimensional projection pursuit using the proposed method. 
In the middle plots the first projection is learnt using one dimensional projection pursuit, and the 
second is set to be the direction of maximum variance within the null space of the first projection. 
The right plots use as the first projection the result of the iterative support vector regression method 
for maximum margin clustering (Zhang et al. 2009), and again the second projection is the direction 
of maximum variance within its null space. 

A similar intuition which underlies the theoretical results of this section can be used to reason 
why this will occur in multivariate projections, in that as the scaling parameter reduces to zero the 
value of X2{L{0)) is controlled by the smallest distance between observations in different elements of 
the induced partition. It is however elusive how to formulate this rigorously in the presence of the 
constraint set A in more than one dimension. 


6 Speeding up Computation using Microclusters 


In this section we discuss how a preprocessing of the data using microclusters can be used to sig¬ 
nificantly speed up the optimisation process. We derive theoretical bounds on the error induced by 
this approximation. Our approach uses a result from matrix perturbation theory for diagonally dom¬ 
inant matrices, and therefore only applies to the standard Laplacian, L{6). However, we have seen 
empirically that a close approximation of the optimisation surface is obtained for both X2{L{6)) and 
X2{L 

norm iO))- _ 


The concept of a microcluster was introduced by Zhang et al. (19961 in the context of clustering 


very large data sets. Microclusters are small clusters of data which can in turn be clustered to generate 
a clustering of the entire data set. A microcluster like approach in the context of spectral clustering 


has been considered by Yan et al. (2009), where the authors obtain bounds on the mis-clustering 


rate induced by the approximation. Rather than using microclusters as an intermediate step towards 
determining a final clustering model, we use them to form an approximation of the optimisation 
surface for projection pursuit which is less computationally expensive to explore. The error bound 
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depends on the ratio of cluster radii to scaling parameter. As such, this method does not provide a 
good approximation when a is close to zero. Our bounds rely on the following result from perturbation 
theory. 


Theorem 5 

Let A= \a 
let X 


^ (200^ 


1 < A 2 < 


ijj and A = [dij] be two symmetric positive semidefinite diagonally dominant matriees, and 
< A„ and Ai < A 2 < ... < A„ be their respeetive eigenvalues. If, for some 0 < e < 1, 


\aij - dij\ < e\a.ij 


Vi 7 ^ j, and 1?;^ — Ui| < evi Vi, where Vi = an — similarly for Vi, then 


IAj A^I ^ eXi Vi. 

An inspection of the proof of Theorem reveals that e < 1 is necessary only to ensure that the signs 
of Qij are the same as those of dij. In the case of Laplacian matrices this equivalence of signs holds 
by design, and so in this context the requirement that e < 1 can be relaxed. 


In the microcluster approach, the data set X = {xi,...,a; 7 v}is replaced with K points Ci,...,ck which 
represent the centers of a AT-clustering of X. By projecting these microcluster centers during sub¬ 
space optimisation, rather than the data themselves, the computational cost associated with each 
eigen problem is reduced from 0{N’^) to 0{K'^). If we define the radius, p, of a cluster C to be the 
greatest distance between one of its members and its center, that is. 


p{C) = max 

x£C 


1 

[q 


E- 

xec 


(29) 


then we expect the approximation error to be small whenever the microcluster radii are small. The 
bounds on the approximation error which we present in this section are worst case and rely on standard 
eigenvalue bounds, and so can be pessimistic. To obtain a reasonable bound on the approximation 
surface, as many as AT« 0.6N might be needed, leading to only a threefold speed up. We have 
observed empirically, however, that even for K — O.IN (and sometimes lower) one still obtains a close 
approximation of the optimisation surface. This makes the projection pursuit of the order of 100 times 
faster. 


Lemma 6 LetC = Ci, ...,Ck be a K-elustering of X with centers ci, ...,Ck , radii pi,px and counts 
ni,...,nx. For 0 G Q define N (9), B{0) where N{0) is the diagonal matrix with 

K 


and 

B{e),j = ,yfqn~s{P‘'{e),i,j), 

where P‘^{9) = {V{0)^ci ,..., V{6)^ck} are the projected microeluster centers and the similarities are 
given by s{P‘^{0), i,j) = k{\\TAg (V (9)^Ci) — (V {Q)^ Cj)\\ / a) , and k{x) is positive and non-increasing 

for x > 0. Then, 


\X2{L{e))-X2{N{e)-B{e))\ 

Mm) 


< max max 


k{Dij/a) _ 

k{{D,j - Pt- Pj)+/ay 

fcCAj/q-) __ 

k{{D,j A piP pj)/a) 


where Dij = \\TAg{V{9)^Ci) — TAg{V{9)^Cj)\\ and (a;)+ =max{0 ,x}. 
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Proof For brevity we temporarily drop the notational dependence on 6. Let P°' = {V^ci, 

ck,V^ck}, where each Ci is repeated times. Let L‘^' be the Laplacian of the graph with 
vertices P^' and edges given by s{P^',i,j). We begin by showing that X2{L‘^') = X2{N — B). Take 
V G then, 

v^{N - B)v = '^s{P‘^,i,j){vfnj - ViVj^rnnj) 

= ]^^s{P‘^,i,j){v‘fnj +Vjni - 2viVjy/ninj) 
id 

> 0 , 


and so N — B is positive semi-definite. In addition, it is straightforward to verify that {N — B){yXni ... 
y/nx) = 0, and hence 0 is the smallest eigenvalue oi N — B with eigenvector (^/rii ... ^nx)- Now, 
let u be the second eigenvector of L^'. Then Uj = Uk for pairs of indices j, k aligned with the same 
V^Ci in P‘^'. Define G s.t. u‘[ = y/niuj where index j is aligned with I^^Ci in Pj'. Then 
(u^)^ {-^/rii ... y/nx) = where index ji is aligned with Ci in Pj:' for 

each i. Therefore n^Uj^ = J2j:P‘='=v^ci hence ... y/nx) = J2j-.p'?'=v~^ci = 

53^1 = 0 since 1 is the smallest eigenvector of L‘^' and so u T 1. Similarly = 

53^1 d = 1- Thus T (vTfi ■ • ■ y/nx) and ||u‘^|| = 1 and so is a candidate for the second eigenvector 
oi N — B. In addition it is straightforward to show that {N — B)u‘^ = u ■ L'^'u. Now, suppose by 
way of contradiction that 3w T {^/rii ... ^/rix) with ||r(;|| = 1 s.t. {N — B)w < {N — B)u‘^. 
Then let w' = {wi/^/ni ... wx/y/nx) where each Wij^/ni is repeated rii times. Then 

||w'|| = 1, (w')^! = {y/ni ... y/nx) = 0 and L^'w < vJL^'u, a contradiction since u is the sec¬ 
ond eigenvector of 

Now, let i,j,m,n be such that Xm G Ci and G Cj. We temporarily drop the notational depen¬ 
dence on A. Then, 

||T(y^a:„) - T{V^Xn)\\ = \\TiV^x^) - T{V^a) + T{V^a) - T{V^c,) 

< \\T{V^x^) - T{V^Ci)\\ + \\T{V^Ci) - T{y^c,)\\ 

^\\T{V^c,)-T{V^x^)\\ 

< Pi + Pj + Dij , 


since T contracts distances and pi and pj are the radii of Ci and Cj. Since k is non-increasing we 
therefore have. 


fe(Ai/o') ^_ fe(Aj/g) _^ fe(Aj7g) 

k{{Dij - Pi- Pj)'^ /cr) ~ k{\\T{V^x„r) - T{V^ Xn)\\/cr) ~ k{{D^j + pi + Pj)/a) 

,_ fc(Aj-/q-) _^_ fe(Aj/g) 

k{\\T{V^x^) - T{V^x^)\\/a) - k{{D,j - p, - pj)+/a) 

and 

_ fe(Aj/g) _^ ^ k{D,j/a) _^ 

k{\\T{V^Xjn)-T{V^Xn)\\/a-) ~ k{{Dij + pi + pj)/a) 

Therefore 


k{D,j/a) 


k(\\T{V^Xm)-T{V^x^)\\/a) 


- 1 


< max si — 


k{Dij/a) 


k{{D,j- Pi- pj)+l a) 
KDrj/(y) 


k{{D^j + pi + Pj)/(j) 


- 1 
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Now, we lose no generality by assume that X is ordered such that for each i the elements of cluster 
Ci are aligned with Ci in P'^', since this does not affect the eigenvalues of the Laplacian of V^X, 
L. By the design of the Laplacian matrix the “vi^ of Theorem are exactly zero. For off diagonal 
terms m, n with corresponding i, j as above, consider 

\Lmn - |fc(A,/a) - kiWTjV^x^) - T{V^x^)\\/a)\ 

\L^u\ k{\\T{V^x^)-T{V^Xn)\\/a) 

__ fcCAj/q-) __ _ 

k{\\T{V^x^)-T{V^Xn)\\/a) 

Theoremthus gives the result. 


The above bound depends on 6 via the quantity Dij and for some functions k it is difficult to 
remove this dependence. We consider the class of functions, parameterised by a > 0, and given by 


k{x) 



exp(-|a:|), 


(30) 


where we adopt the convention (§)° = 1 for any a G K. For a = 0 this is equivalent to the Laplace 
kernel, but for a > 0 has the useful property of being differentiable at 0. We have found the choice 
of k to matter little in the results of the proposed approach. The above class of functions is chosen 
as it allows us to obtain a uniform bound on the error induced by the above approximation. Note 
the parameter a is not intended as a tuning parameter, but rather we set a close to zero to obtain a 
function similar to the Laplace kernel, but which is differentiable at zero. 

Corollary 7 Let the conditions of Lemma hold, and let k(x) = (|a;|/a + 1)“ exp(—|a:|) for a > 0. 
Then, 


\X 2 {l{B)) - X2iN{e) - B{e))\ 

X2im) 

Proof Firstly, consider 

k{Dij/a) 


f Diam{X) + aa 
\ Diam[X) + Pi + pj + era 


< max 


exp 


fc((A, 




-1 = 




Dij + Pi + Pj + era 


exp 


Pj 


Pi + Pj 


- 1 . 


- 1 . 


Now, the function ^ non-decreasing in x for x,y,a,<7 > 0, therefore by above 


k{Dij/a) 


- 1< 


k{{Dij + pi + Pj)/a) 

Secondly, consider the case Dij > Pi + Pj , then 
k{Dij/er) 


/ Diam(X) -|- tra 
\Diam(X) + pi + pj -I- era 


exp 


1 - 


k{{D,j - Pi - pj)+/a) 


= 1 - 


< 1 - 


Dij + aa 


Dij Pi Pj ~\~ a a 
^Diam(X) + pi + Pj + aa 


exp - 


Pi + Pj 


Pi + Pj 


- 1 . 


exp 


Pi + Pj 


Diam(X) + aa 

since exp(?//tT) is non-increasing in x for x, y,a,a > 0. On the other hand, if Dij < pi + pj 

then. 


1 - 


k{D,jla) 


k{{Dij - p,- pj)+/a) 


= l-k 


D 


<l-k 


Pi + Pj 


= 1 - 


< 1 - 


Pi + P] + era 


aa 


exp - 


Pi + Pj 


/ Diam(X) + pi + pj -I- era 


Diam(X) -|- era 


exp - 


Pi + Pj 
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where the first inequality comes from the fact that Dij < pi + Pj and k is decreasing. Now, using the 
identity 1 — ^ < x — 1 for a; 0, we have 


^ / Diam(X)+ o-a \" ^ / Pi + Pj \ ^ 

\ Diam(X) + era ) ^ \ a ) ~ 

/ Diam(X) + aa 
\Diam(X) + pi + pj + aa 

and so Lemmagives the result. 


exp 


Pi + Pj 


- 1 , 


Tighter bounds can be derived if pairwise distances between elements from pairs of clusters are com¬ 
pared directly to the distances between the cluster centers, and for higher dimensional cases the 
additional tightness can be significant. We prefer to state the result as above due to the sole reliance 
on the internal cluster radii relative to scaling parameter. 

While bounds of the above type are not verifiable for Lnorm due to the fact that it is not diagonally 
dominant, a similar degree of agreement between the true and approximate eigenvalues has been ob¬ 
served in all cases considered. In this case the K x K matrix is given by the normalised Laplacian of 
the graph of P^{9) with similarities given by ninjs{P^j). This matrix has the same structure 
as the original normalised Laplacian, the only difference being the introduction of the factors n^, rij. 


Figurej^shows (a) \ 2 {L{ 6 )) and (b) A 2 (Lnorm(^)) plotted against the single projection angle 0 for the 
2 dimensional SI data set ( Franti and Virmajoki[ 20061. The parameter a was chosen using the same 
method as for our experiments. A complete linkage clustering was performed for 3000 microclusters 
(= 60% of total number of data), as well as for 200 microclusters for comparison. The true values of 
X 2 {L{ 9 )) and those based on approximations using 3000 microclusters are almost indistinguishable. 
The approximations based on 200 microclusters also show a good approximation of the optimisation 
surface, and lie well within the bounds pertaining to the 3000 microcluster case. The same sort of 
agreement can be seen for A 2 (Lnorm(^))- Importantly, while the approximations based on 200 micro¬ 
clusters slightly underestimate the true eigenvalues, the location of the local minima, and indeed the 
shape of the optimisation surface, are very similar to the truth, and so optimising over this approxi¬ 
mate surface leads to near optimal projections. We also show the absolute relative error, (c) and (d), 
as described in Lemma^ The pessimism of the bound is clearly evident in the bottom left plot where 
the values of l'*' 2 (.L(g))-^^^(JV(e)~B(g))| g^ppgg^j. ygpy close to zero on the scale of the theoretical bound. 


7 Experimental Results 


In this section we evaluate the proposed method on a large collection of benchmark datasets. We 
compare our approach with existing dimension reduction methods for clustering, where the final 
clustering result is determined using spectral clustering. In addition we consider solving our problem 
iteratively for a shrinking sequence of scaling parameters to find large margin separators, relying on 
the theoretical results presented in Section [Sj We compare these results with the iterative support 


(2009[p a state-of-the-art maximum margin clustering 


vector regression approach of Zhang et al. 
algorithm. 

We compare the different methods based on two popular evaluation metrics for clustering, namely 
purity ( Zhao and Karypis| 2004) and IZ-measure (Rosenberg and Hirschberg 2007). Both metrics 
compare the label assignments made by a clustering algorithm with the true class labels of the data. 
They take values in [0,1] with larger values indicating a better agreement between the two label sets, 
and hence a superior clustering result. Purity is the weighted average of the largest proportion of 


^We are grateful to Dr. Kai Zhang for supplying us with code to implement this method. 
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Figure 4: Approximation Error Plots for SI data set. 




(a) X 2 {L{ 6 )) with Theoretical Error Bounds (b) A 2 (Tnorm(^)) 




(c) with Theoretical Bound (d) l^ 2 (LnormW)-A| 

^ ' A2tnnor-m(®)) 

True eigenvalue (-), bounds based on 3000 microclusters (-), approximation using 3000 microclusters (-), 

approximation using 200 microclusters (-) 


each cluster which can be represented by a single class, ^-measure is defined as the harmonic mean 
of measures of completeness and homogeneity. Homogeneity is similar to purity, in that it measures 
the extent to which each cluster may be represented by a single class, but is given by the weighted 
average of the entropy of the class distribution within each cluster. Completeness is symmetric to 
homogeneity, and measures the entropy of the cluster distribution within each class. H-measure 
therefore also captures the extent to which classes are split between clusters. 

We will use the following notation throughout this section: 

• SCP^ and SC„P^ refer to the proposed projection pursuits for minimising X2{L{9)) and A 2 (inorm(^)) 
respectively. 

• LMSC refers to the proposed approach of finding large margin separators, based on repeatedly 
minimising \ 2 {L{d)) for a shrinking sequence of scaling parameters. 

• Subscripts “o” and “c” indicate whether we use an orthogonal projection (w > 0) or a correlated 
one (w < 0), respectively. 

• SC and SC„ refer to spectral clustering based on the eigen-decompositions of L and 
respectively. 
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• Subscripts “PCA^^ and ^PCA” indicate principal and independent component analysis projec¬ 
tions respectively. For example, SCnPCA refers to spectral clustering using the normalised 
Laplacian applied to the data projected into a principal component oriented subspace. 


• DRSC abbreviates dimensionality reduction for spectral clustering, proposed by Niu et al. (20111. 
This existing approach applies only to the normalised Laplacian. 


iSVRi and iSVRc denote the iterative support vector regression approach for maximum margin 
clustering (Zhang et al. 2009), using the linear and Gaussian kernels respectively. 


7.1 Details of Implementation 


To extend our approach to datasets containing multiple clusters, we simply recursively partition 
(subsets of) the dataset until the desired number of clusters is obtained. We prefer this approach to 
the alternative of directly seeking a projection which yields a full K way partition of the dataset, i.e., 
by minimising the sum of the first K eigenvalues of the Laplacians, as it is not always clear that all 
the clusters present in the data can be exposed using a single projection of fixed dimension. At each 
iteration in this recursive bi-partitioning we split the largest remaining cluster. 

The scaling parameter and initialisation are set for each bi-partition, given values determined by 
the subset of the data being partitioned. For the fixed scaling parameter approaches, SCP^ and 
SC„P^, we set a = , where I is the dimension of the projection, N is the size of the (subset 

of the) data and is the largest eigenvalue of the covariance matrix. The value ^/Xd captures the 
scale of the data, while '/I accounts for the fact that distances scale roughly with the square root 
of the dimension. The denominator term, is borrowed from kernel density methods and we 

have found it to work reasonably well for our applications as well. For the large margin approach, 
LMSC, a is initialised at and decreased by a factor of two with each minimisation of 

X2{L{0)), until convergence of the projection matrix. The initialisation of V{9) is via the first I 
principal components. For the orthogonal projections we use a two dimensional projection, as this 
is the lowest dimensional space which can expose non-linear separation between clusters. For the 
correlated projections we provide a three dimensional initialisation, and it was found that in most 
cases a high quality univariate projection could be determined from this. 

For the LMSC approach, because the values within the Laplacian matrix approach zero, the 
optimisation becomes less robust, and we found that the correlated approach did not always lead to 
large margin separation. We believe this is as a result of the term controlling the correlation becoming 
too dominant relative to the decreasing eigenvalue unless very careful tuning of w is performed. We 
therefore consider a univariate projection instead of the multivariate correlated approach in this case. 

Recall that the parameter /3 controls the size of the constraint set A. It is clear that smaller 
values of /? will tend to lead to more balanced partitions, but a precise interpretation of the resulting 
cluster sizes is unavailable. At best bounds on the cluster sizes can be computed using Chebyshev’s 
inequality. Rather than relying on these bounds, which may be loose and difficult to interpret in 
multivariate projections, we recommend applying the proposed method for a range of values of /? and 
selecting the solution corresponding to the largest value of /? which induces a specified balance in the 
partition. We define this balance to be satisfied if the smallest cluster size is at least half the average, 
i.e., N/2K. In this way the effect of the constraint is limited while still producing the desired result. 
We initialise with a large value of /3 and decrease by 0.5 until the balance is met. If this balance is 
not met for /3 = 0.5, then the corresponding “unbalanced” result is returned anyway. 

The parameter 6 is set to minjO.Ol, cr^} and a to 0.1. We have found these two parameters not 
to significantly influence the performance of the method. It is important to note, however, that that 
parameter a controls the shape of the similarity function, and as a result there is an interplay between 
this value and the value of cr. For substantially larger values of a we expect a smaller value of a to 
be more appropriate. 

For competing approaches based on spectral clustering we do the following. Whenever the number 


of data exceeds 1000 we use the approximation method of Yan et al. (2009). Following Niu et al. 
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(2011), we set the reduced dimension to K — 1, where K is the number of clusters. We compute 
clustering results for all values of cr in {0.1,0.2, 0.5,1, 2, 5,10, 20, 50, 100, 200} as well as for the local 
scaling approach of Zelnik-Manor and Perona (2004), and report the highest performance in each 
case. For DRSC we also considered the parameter setting used for our method, and to implement 
the local scalings of Zelnik-Manor and Perona (2004) these were recomputed with each iteration in 
the corresponding projected subspace. We also provided DRSC with a warm start via PCA as this 
improved performance over a random initialisation, and offers a more fair comparison. Because of 
this extensive search over scaling parameters we expect competing methods to achieve very high 
performance whenever the corresponding dimension reduction is capable of exposing the clusters in 
the data well. 

For the iSVR maximum margin clustering method, we set the balancing parameter equal to 0.3 as 
suggested by Zhang et al. (2009) when the cluster sizes are not balanced. We argue that the balance 
of the clusters will not be known in practice, and the unbalanced setting led to superior performance 
compared with the balanced setting in the examples considered. The iSVR approach also generates 
only a bi-partition, and to generate multiple clusters we apply the same recursive approach as in our 
method. 


7.2 Clustering Results 


The following benchmark datasets were used for comparison: 


• Optical recognition of handwritten digits (Opt. Digits)]^ 5620 8x8 compressed images of 
handwritten digits in {0, ...,9}, resulting in 64 dimensions with 10 classes. 

• Pen based recognition of handwritten digits (Pen Digits).^ 10992 observations, each correspond¬ 
ing to a stylus trajectory {x, y coordinates) from a handwritten digit in (0,..., 9}, i.e., 10 classes. 
The trajectories are sampled at 8 time points, resulting in 16 dimensions. 

• Satellite.^ 6435 multispectral values from 3x3 pixel squares from satellite images, which results 
in 36 dimensions. There are 6 classes corresponding to different land types. 

• Breast cancer Wisconsin (Br. Cancer).^ 699 observations with 9 attributes relating to tumour 
masses. There are 2 classes corresponding to benign and malignant masses. 

• Congressional votes (Voters).^ 435 sets of 16 binary decisions on US congressional ballots. The 
2 classes correspond to political party membership. 

• Dermatology.^ 366 observations corresponding to dermatology patients, each containing 34 di¬ 
mensions derived from clinical and histopathological features. There are 6 classes corresponding 
to different skin diseases. 


• Yeast cell cycle analysis (Yeast)698 yeast genes across 72 conditions (dimensions). There are 
5 classes corresponding to different genotypes. 

• Synthetic control chart (Chart).^ 600 simulated time series of length 60 displaying one of 6 
fundamental characteristics, leading to 6 classes. 


Multiple feature digits (M.F. Digits).^ 2000 handwritted digits in (0, ...,9} taken from Dutch 
utility maps. Following Niu et al. (2011) we use only the 216 profile correlation features. 


Statlog image segmentation (Image Seg.).^ 2310 observations containing 19 features derived from 
3x3 pixel squares from 7 outdoor images. Each image constitutes a class. 


^https://archive.ics.uci.edu/ml/datasets.html 
^ http://genome-www.Stanford.edu/cellcycle/ 
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Table 1: Purity results for spectral clustering using the standard Laplacian, L. Average performance 
from 30 runs on each dataset, with standard deviation as subscript. The highest average performance 
in each case is highlighted in bold. 



SCP2 

SCP2 

SCpcA 

SC,cA 

SC 

Opt. Digits 

0.81o.o5 

0.83o,o6 

0.33o.o5 

0.20o,o2 

O.llo.oo 

Pen Digits 

0.78o.oi 

0.76o.oi 

0.64o.o2 

0.53o,o3 

0.62o.o3 

Satellite 

0-75o.oo 

0.73o.o3 

0.63o.oi 

0.72o,oi 

0.62o.o2 

Br. Cancer 

0-97’o.oo 

0.97o,oo 

0-97o,oo 

0-97o.oo 

O. 960.00 

Voters 

0.84o,oo 

0.83o.oo 

0 . 860,00 

0 . 860.00 

O. 780.00 

Dermatology 

0.94o.oo 

O. 9 O 0.00 

O. 9 I 0.00 

0.89o,oo 

O. 560.00 

Yeast 

0.75o.oo 

0.74o.oo 

0.67o.oo 

0.62o,oo 

O. 6 O 0.00 

Chart 

0.84o.oo 

0.83o.oo 

0.72o.o6 

0.65o,o7 

0.56o.o5 

M.F. Digits 

0.83o.o2 

0.79o.o2 

0.53o.o3 

0.34o,o3 

O. 3 I 0.04 

Image Seg. 

0.62o,o3 

0 . 680,03 

0.53o.o2 

0.46o,o2 

0.56o.o3 


Before applying the clustering algorithms, data were rescaled so that every feature had unit vari¬ 
ance. This is a standard approach to handle situations where different features are captured on dif¬ 
ferent scales and an appropriate rescaling is not obviously apparent from the context. For consistency 
we used this same preprocessing approach for all datasets. 

7.2.1 Spectral Clustering Using the Standard Laplacian 

Tables [2 and report the purity and U-measure respectively for the proposed method and spectral 
clustering using the standard Laplacian applied to the original data, as well as their projection into 
PCA and ICA oriented subspaces. The tables report the average and standard deviation (as subscript) 
from 30 repetitions. The highest average performance for each dataset is highlighted in bold. Both 
the orthogonal and correlated projection approaches achieve substantially higher performance than 
other methods in the majority of cases. There are few cases where they are not competitive with the 
best performing of the competing approaches, while there are mulitple examples where the proposed 
methods strongly outperform all others. The two versions of the proposed method are closely compa¬ 
rable with one another on average, with the correlated approach offering a slightly better worst case 
comparison. This, however, does not appear highly significant beyond sampling variation both within 
each dataset and with respect to the collection of datasets used for comparison. What is evident is that 
the added flexibility offered by multivariate projections does not result in a substantial improvement 
over univariate projections, which induce linear cluster boundaries. 


7.2.2 Spectral Clustering Using the Normalised Laplacian 

Tables 1^ and 1^ report the purity and U-measure respectively for the proposed approach based on 
minimising A 2 (Lnorm(^))) tbe dimensionality reduction for spectral clustering algorithm (Niu et al. 
2011[) and spectral clustering based on the normalised Laplacian applied to the original data and their 


PCA and ICA projections. Again the tables show the average and standard deviation from 30 runs 
of each method, with the highest average performance on each dataset highlighted in bold. 

The proposed approach using both correlated and orthogonal projections is again competitive with 
all other methods in almost all cases considered. In addition both versions of the proposed approach 
substantially outperform the other methods in multiple examples. Unlike in the case of the standard 
Laplacian, here there is evidence that the orthogonal projection achieves better clustering results in 
general, outperforming the correlated approach in the majority of examples. 
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Table 2: y-measure results for spectral clustering using the standard Laplacian, L. Average per¬ 
formance from 30 runs on each dataset, with standard deviation as subscript. The highest average 
performance in each case is highlighted in bold. 



SCP2 

SCP2 

SC PC A 

SC/c.4 

SC 


Opt. Digits 

0.77o,o3 

0.78o,o4 

0.40o,o5 

0.21o,o3 

O.Olo 

00 

Pen Digits 

O.TSo.oi 

0.74o.o2 

0 . 660.01 

0.54o,o3 

0.64o 

02 

Satellite 

O. 6 O 0.00 

0.59o.o3 

O.5O0.00 

O. 6 O 0.01 

O.5O0 

01 

Br. Cancer 

0.79o,oo 

O.SOq.oo 

O.780.00 

0.78o,oo 

0.77o 

00 

Voters 

0-42o.oo 

O.380.00 

0.41o,oo 

0.41o.oo 

O.3O0 

00 

Dermatology 

0.89o.oo 

0.83o.oo 

0 . 860.00 

0.82o,oo 

0.58o 

00 

Yeast 

0.54o,oo 

0.54o,oo 

O.5I0.00 

0.40o,oo 

O.4I0 

00 

Chart 

0.77o,oo 

0.77o.oo 

0.81o,o2 

0.73o,o4 

O.7O0 

02 

M.F. Digits 

0.76o.oi 

0.73o.o2 

0.56o.o2 

0.36o,o5 

0.38o 

05 

Image Seg. 

0.62o,oi 

0.65o.o2 

0.53o.oi 

0.46o,oi 

0.58o 

02 


Table 3: Purity results for spectral clustering using the normalised Laplacian, Lnorm- Average per¬ 
formance from 30 runs on each dataset, with standard deviation as subscript. The highest average 
performance in each case is highlighted in bold. 



sc„p2 

SC„P2 

DRSC 

SCnPCA 

SC„7CA 

sc„ 

Opt. Digits 

O-SIq.os 

0.74o,o6 

0-80o,o3 

0.66o,o3 

0-64o.oi 

0 . 660.02 

Pen Digits 

0.78o,oo 

0.74o,o2 

0.69o,o4 

0-76o,o3 

0.74o.oi 

0.77o.o4 

Satellite 

0.75o.oo 

0.74o,o3 

0.73o,oi 

0.76o,oi 

0.73o.o2 

0.74o.oi 

Br. Cancer 

0.97o,oo 

0.97o.oo 

0.96o,oo 

0.97o,oo 

0.97o,oo 

0.97o,oo 

Voters 

0.85o.oo 

0.84o,oo 

0 . 860.00 

0 . 860,00 

0 . 860,00 

0.85o.oo 

Dermatology 

0 . 860.00 

0.9Io,oo 

0-87o,oo 

0.92o.o2 

0-91o.oo 

0.95o,oo 

Yeast 

0.76o,oo 

0.70o,oo 

0.62o,oo 

0-71o.oo 

0.69o.oi 

O. 6 O 0.00 

Chart 

0.87o,oo 

0.85o.oo 

0.75o,oo 

0.7Io.o6 

O. 8 O 0.07 

0.75o.oo 

M.F. Digits 

0.84o,oi 

0.79o,oi 

0.77o,o3 

0.77o,o3 

0.79o.oi 

0.77o.o2 

Image Seg. 

0 . 660.01 

O. 7 O 0.01 

0.65o.o4 

O. 6 I 0.03 

0.55o.o2 

0.62o.o2 


Table 4: P-measure results for spectral clustering using the normalised Laplacian, Lnorm- Average 
performance from 30 runs on each dataset, with standard deviation as subscript. The highest average 
performance in each case is highlighted in bold. 



SC„P2 

SC„p2 

DRSC 

SCnPCA 

SCnICA 

sc„ 

Opt. Digits 

0-77o,o3 

0.72o,o4 

0.75o,o2 

0.62o,o2 

0 . 600.01 

0.62o.oi 

Pen Digits 

0.76o,oo 

0.74o,oi 

0 . 660,02 

0.72o,oi 

0 . 680.01 

0.73o.o2 

Satellite 

O. 6 I 0.01 

0.60o,o3 

0.57o,oi 

0.62o,oi 

0 . 600.05 

O. 6 O 0.01 

Br. Cancer 

0.79o.oo 

O. 8 I 0.00 

0.76o,oo 

0.81o,oo 

0.81o,oo 

0.79o.oo 

Voters 

0-42o,oo 

0.4Io,oo 

0.42o.oo 

0.4Io.oo 

0.41o.oo 

0.42o,oo 

Dermatology 

0.85o.oo 

0 . 860,00 

0.80o,oo 

0.86o.o2 

0.83o.oo 

0.91o,oo 

Yeast 

0.57o,oo 

0.45o.oo 

0.4Io.oo 

0.54o,oo 

0.53o.oo 

0.45o.oo 

Chart 

0.81o,oo 

0.80o,oo 

0.75o,oo 

0.81o,o2 

O. 8 O 0.05 

0.73o.oo 

M.F. Digits 

0.76o,oi 

0.74o,o2 

0.69o,o2 

O. 7 I 0.02 

0.75o.oi 

O. 7 O 0.02 

Image Seg. 

0.65o.oi 

0 . 680.01 

0.62o,o4 

O. 6 O 0.02 

0.46o.o2 

O. 6 O 0.02 
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Table 5: Purity results for large margin clustering. Average performance from 30 runs on each dataset, 
with standard deviation as subscript. The highest average performance in each case is highlighted in 
bold. 



LMSCo 

LMSC 

ISVRl 

iSVRc 

Opt. Digits 

0.74o.o5 

0.69o,o5 

0.76o,oo 

O. 6 I 0.01 

Pen Digits 

O. 8 O 0.02 

O. 7 I 0.04 

0.78o.oo 

O. 780.00 

Satellite 

0.75o.oi 

0-72o,o3 

0 - 68 o.oo 

0 - 68 o.oo 

Br. Cancer 

O. 960.00 

0.97o.oo 

0.90o,oo 

0.95o.oo 

Voters 

0.84o.oo 

0.84o.oo 

0-81o.oo 

0.84o,oo 

Dermatology 

0 . 860.00 

0 . 860.00 

0-80o.oo 

O. 8 O 0.00 

Yeast 

0-75o.oo 

0.75o.oo 

0.75o,oo 

O. 7 O 0.00 

Chart 

0-89o.oo 

0.83o,oo 

0-72o.oo 

0-72o.oo 

M.F. Digits 

0-82o.o5 

0.74o,o4 

0-67o.oo 

O-bOo.oi 

Image Seg. 

O. 6 O 0.04 

0.65o.o3 

0.64o.oo 

0 . 660.01 


7.2.3 Large Margin Clustering 


It is important to note that the method described in Section]^ does not provide a close approximation 
as cr —>■ 0^. For the datasets containing more than 1000 data we use the microcluster approach for all 
values of tr and therefore only guarantee a large separation between the microclusters. It is arguable 
that this is a preferable objective as the maximum margin is not robust in the presence of noise, 
and it is not clear that it converges in the general setting (Ben-David et al. 20091. Microclusters 


have the potential to absorb some of the noise in the data, and in the event that they are of roughly 
equal density, maximising the margin over the microcluster centers has a similar effect to that of 
minimising the empirical density in a neighbourhood of the corresponding hyperplane separator. This 


is reminiscent of the soft-margin approach which does enjoy strong convergence properties (Ben-David 


et al., 2009). In addition, since the optimisation is reinitialised for each value of cr, we are able to 


recompute the microclusters by performing the coarse clustering on the projected data with each 
iteration. This tends to lead to the margins in the microclusters being more closely related to the 
margins in the full dataset along the optimal projections. 

Tables and report the average and standard deviation of the proposed LMSC as well as the 


iterative support vector regression approach (Zhang et al., 20091 using both a linear and Gaussian 
kernel for comparison. Both versions of LMSC, using an orthogonal two dimensional and a one 
dimensional projection, outperform both versions of the iterative support vector regression in the 
majority of cases, with substantially higher performance in multiple examples. There is strong evidence 
that the two dimensional LMSCq obtains better quality clustering results than the one dimensional 
alternative, showing sunstantially higher performance in the vast majority of cases considered. 


7.3 Summarising Clustering Performance 

Thus far we have compared different approaches for standard and normalised spectral clustering and 
for large margin clustering separately. These separate comparisons are important to understand the 
benefits of the proposed methods, however when considering the clustering problem abstractly it is 
necessary to compare all methods jointly. It is already clear that no method is uniformly superior to all 
others, since even within the separate comparisons no method outperformed the rest in every example. 
We find it important to reiterate the fact that for competing methods based on spectral clustering an 
extensive search over scaling parameters was performed and the best performance reported, whereas 
for our method only a simple data driven heuristic was used in every example. Such a search is not 
possible in practice since the true lables will not be known, and hence the results reported for these 
methods likely overestimate their true expected performance in practice. What was evident, however. 
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Table 6: ^-measure results for large margin clustering. Average performance from 30 runs on each 
dataset, with standard deviation as subscript. The highest average performance in each case is high¬ 
lighted in bold. 



LMSCo 

LMSC 

iSVRi 

iSVRc 

Opt. Digits 

0-70o.o3 

0.62o.o4 

0-72o.oo 

0.57o.oo 

Pen Digits 

0.76o.o2 

0 . 660.03 

0.72o,oi 

0.73o.oo 

Satellite 

O. 6 I 0.01 

0.57o.o3 

0.55o,oo 

0.55o.oo 

Br. Cancer 

0-76o.oo 

0.78o,oo 

0.55o,oo 

0-72o.oo 

Voters 

0.43o,oo 

O. 380.00 

0.34o,oo 

0-42o.oo 

Dermatology 

0 . 860 , 00 

0.85o.oo 

0.77o.oo 

0.74o.oi 

Yeast 

0.56o.oo 

0.58o,oo 

0.55o,oi 

0.53o.oo 

Chart 

0.85o.oo 

O. 780.00 

0 . 660,00 

0-72o.oo 

M.F. Digits 

0-75o,o3 

0-69o.o3 

0.60o,oo 

0-62o.oi 

Image Seg. 

0.60o.o3 

0.63o,o3 

0.63o.oo 

O. 6 O 0.01 


is that the local scaling approach of Zelnik-Manor and Perona (20041 is very effective and yielded the 


highest performance in roughly half the cases considered. 

It is clearly apparent from the performance of the various methods that the clustering problem 
differs vastly in difficulty across the different datasets considered. To combine the results from the 
different datasets we standardise them as follows. For each dataset D we compute for each method 
the relative deviation from the average performance of all methods when applied to D. That is, for 
each method Mi we compute the relative purity. 


Rel.Purity(Mi, D) = 


Purity (M„ D) - ^Melhods Purity (M,, D) 


1 

^Methods ^j — 1 


^Methods 


Purity (Mj, I?) 


(31) 


and similarly for P-measure. We can then compare the distributions of the relative performance 
measures from all datasets and for all methods. It is clear from Table [l] that the competing methods 
SC, SC PC A and SCjca are not competitive with other methods in general, due to their vastly inferior 
performance on multiple datasets. Moreover, their performance is sufficiently low to obscure the com¬ 
parisons between others. These three methods are therefore omitted from this comparison. Figures 
and show boxplots of the relative performance measures for all other methods. The additional red 
dots indicate the mean relative performance measures for each method, and methods are ordered in 
decreasing order of their means. In the case of purity, all of the proposed methods outperform every 
method used for comparison, and except for the univariate large margin method, LMSC, the difference 
between the proposed methods and the methods used for comparison is substantial. In the case of 
P-measure, the same is true except that in this case LMSC is outperformed on average by spectral 
clustering using the normalised Laplacian applied to the PCA projected data. Notice that the most 
relevant comparison for LMSC is with iSYRi because of their similar objectives. In terms of both 
purity and P-measure, LMSC significantly outperformed the existing large margin clustering method. 

Among the methods used for comparison, it is evident that spectral clustering is capable of out¬ 
performing existing large margin clustering methods, provided an appropriate scaling parameter can 
be determined. Of those spectral clustering variants, PCA projections showed the best overall perfor¬ 


mance. While the DRSC method (Niu et al. 2011) in some cases showed a substantial improvement 


over the simpler dimension reduction of PCA, it did not yield consistently higher performance on the 
datasets considered. 

Overall it is apparent that the proposed approach for projection pursuit based on spectral con¬ 
nectivity is highly competitive with existing dimension reduction methods. Moreover, a simple data 
driven heuristic allowed us to select the important scaling parameter automotically without tuning it 


for each dataset, as is recommended for the DRSC method (Niu et al., 20111. Among the variants of 
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the proposed approaches, it is evident that while the flexibility of the multivariate projections offered 
higher performance on average than the corresponding univariate projections, it is only in the case of 
the large margin separation methods that this improvement is significant beyond the variation from 
the collection of datasets used for comparison. 

Figure 5: Box plots of relative purity with additional red dots to indicate means. Methods are ordered 
with decreasing mean value. 



7.4 Sensitivity Analysis 

In this subsection we investigate the sensitivity of the proposed approach to the setting of the impor¬ 
tant scaling parameter, a. In addition we consider the effect on performance of the number of micro¬ 
clusters used in approximating the optimisation surface. For the former we consider the breast cancer, 
voters, dermatology, yeast and chart datasets as these exhibited very low variability in performance 
and offer more interpretable comparisons. Figures and show plots of the purity and F-measure 
values for a taking values in {O.lcroi O-Scq, O.Sctq, o-q, 2cro, 5 (To, IOcq}, where 0 ^ = is the 

value used in the experiments above. There is some variability in the performance for different values, 
with no clear pattern inditcating that a higher or lower value than the one used is better in general. 
Importantly there are very few occurences of substantially poorer performance than that obtained 
with our simply chosen heuristic, and also it is clear that in the majority of cases performance could 
be improved from what is reported above if an appropriate tuning of a is possible. 

To investigate the effect of microclusters on clustering accuracy we simulated datasets from Gaus¬ 
sian mixtures containing 5 components (clusters) in 50 dimensions. This allows us to generate datasets 
of any desired size. For these experiments 30 sets of parameters for the Gaussian mixtures were gener¬ 
ated randomly. In the first case a single dataset of size 1000 was simulated from each set of parameters, 
and clustering solutions obtained for a number of microclusters, iF, ranging from 100 to 1000, the 
final value therefore applying no approximation. Figure shows the median and interquartile range 
of both performance measures for 10 values of K. It is evident that aside from K = 100, performance 
is similar for all other values, and so using a small value, say K = 200, should be sufficient to obtain 
a good approximation of the underlying optimisation surface. 
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Figure 6: Box plots of relative T^-measure with additional red dots to indicate means. Methods are 
ordered with decreasing mean value. 
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In the second, we fix the number of microclusters, K = 200, and for each set of parameters simulate 
datasets with between 1000 and 10 000 observations. In the most extreme case, therefore, the number 
of microclusters is only 2% of the total number of data. Figurej^shows the corresponding performance 
plots, again containing the medians and interquartile ranges. Even for datasets of size 10 000, the 
coarse approximation of the dataset through 200 microclusters is sufficient to obtain a high quality 
projection using the proposed approach. 


8 Conclusions 

We proposed a projection pursuit method for hnding the optimal subspace in which to perform a binary 
partition of unlabelled data. The proposed method optimises the separability of the projected data, 
as measured by spectral graph theory, by minimising the second smallest eigenvalue of the graph 
Laplacians. The Lipschitz continuity and differentiability properties of this projection index with 
respect to the projection matrix were established, which enabled us to apply a generalised gradient 
descent method to find locally optimal solutions. Compared with existing dimension reduction for 
spectral clustering, we derive expressions for the gradient of the overall objective and so find solutions 
within a single generalised gradient descent scheme, with guaranteed convergence to a local optimum. 
Our experiments suggest that the proposed method substantially outperforms spectral clustering 
applied to the original data as well as existing dimensionality reduction methods for spectral clustering. 

A connection to maximal margin hyperplanes was established, showing that in the univariate case, 
as the scaling parameter of the similarity function is reduced towards zero, the binary partition of 
the projected data maximises the linear separability between the two clusters. Implementing our 
method for a shrinking sequence of scaling parameters thus allows us to find large margin separators 
practically. We found that this approach outperforms state of the art methods for maximum margin 
clustering on a large collection of datasets. 

The computational cost of the proposed projection pursuit method per iteration is 0{dN^), where 
N is the number of observations, and d is the dimensionality, which can become prohibitive for 
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Purity Purity 


Figure 7: Sensitivity analysis for varying a. Standard Laplacian. The x-axis contains the multiplica¬ 
tion factor applied to the default scaling parameter used in the experiments. 



o 



> 


C\J 

o 

p 

^ ^^^- 1 -^- 1 -^ 

0.1 0.2 0.5 1 2 5 10 

sigma scaie 


(a) SCP2 




0.1 0.2 0.5 1 2 5 10 

sigma scaie 


Br. Cancer ( 


(b) SCP2 

), Voters (—A—), Dermatology (— x— ), Yeast (—o—), Chart (— o— ) 


31 




















Purity Purity 


Figure 8: Sensitivity analysis for varying tr. Normalised Laplacian. The at-axis contains the multipli¬ 
cation factor applied to the default scaling parameter used in the experiments. 
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Performance Performance 


Figure 9: Sensitivity analysis for varying number of microclusters, K. Plots show median and in¬ 
terquartile ranges of performance measures from 30 datasets simulated from 50 dimensional Gaussian 
mixtures with 5 clusters and 1000 observations. 
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Performance Performance 


Figure 10: Sensitivity analysis for fixed number of microclusters, K = 200, and varying number of 
data. Plots show median and interquartile ranges of performance measures from datasets simulated 
from 50 dimensional Gaussian mixtures with 5 clusters and between 1000 and 10 000 observations. 
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large datasets. To ameliorate this an approximation method using microclusters, with provable error 
bounds is proposed. Our sensitivity analysis, based on clustering performance, indicates that even 
for relatively few microclusters, the approximation of the optimisation surface is adequate for finding 
high quality subspaces for clustering. 


A. Derivatives 


In the general case we may consider a set of K microclusters with centers ci,...,ck and counts 
ni, ...jUk- The derivations we provide in this appendix are valid for ni = l Wi € {1,..., K}, and so 
apply to the exact formulation of the problem as well. Let 0 G & and let P be the repeated projected 
cluster centers, P = {pi, ,pi, ...pk,Pk} = {V{6Yci, , V(0)^ci,..., V{6)^ck}, where each V{9)^Ci is 
repeated rii times. In Section]^ we expressed DgX via the chain rule decomposition DpXDyPDgv. 
The compression of P to the size K non-repeated projected set, = {pi, ...,pk}, requires a slight 
restructuring, as described in Section 


We begin with the standard Laplacian, and define N{6) and B{0) as in Lemma That is, 
N{0) is the diagonal matrix with i-th diagonal element equal to = 

^ninjs{P^, i,j). The derivative of the second eigenvalue of the Laplacian of P relies on the corre¬ 
sponding eigenvector, u. However, this vector is not explicitly available as we only solve the K x K 
eigen-problem of N{0) — B{0). Let u'" be the second eigenvector of N[9) — B{9). As in the proof 
of Lemma if i,j are such that the Pth element of P corresponds to the j-th microcluster, then 
uf = y/fi^Ui. The derivative of X2{N{9) — B{0)) with respect to the Pth column of 9, and thus equiv¬ 
alently of the second eigenvalue of the Laplacian of P, is therefore given by 


1 

2 







ds {P^, j, k) 
dPu 







ds jP^J, k ) 
dPiK 

CK f Dg^V,, 


(32) 


where (ci ... ck) is the matrix with j-th column c^, P is treated as al x N matrix with f-th column pi, 
and Dg. Vt is given in Eq. 0. Now, the use of the constraint set Ag and the associated transformation 
makes a further decomposition convenient. Let T = {ti ,..., tk} = {Ta^ (pi), ■ ■ •, Taa (j>k)}- We provide 
expressions for the specific constraint sets used, i.e., Ag - /3ag^,pg. + f3ag^], where pg^ = 

J2f=i i® approximated by ^f=i ~ k-SiY- For ease of exposition we assume 

that each pg. is equal to zero, noting that no generality is lost through this simplification since the 
value of the eigenvalue of the Laplacian is location independent. The data can therefore be centered 
prior to projection pursuit and the following formulation employed. We can then express the first 


component of (32) as Dt^XD pcTi, where 



TljTlf- 



dk 


llL-*dl' 


'^j'^k " 


dT, 


iK 


(33) 
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and DpcTi is the K x K matrix with 


{DpcTi)j^k = 

{DpcTi)jj = 


S(l-S)PnkPik/Nas. 


(-/3<Te^-P.,+(5(l-5))i/^)S > 

0rikPik 

Naa^ ’ 

2PnkPik S{l-S)f3nkPik/Na-a. 

Naa^ (P,,-/3f7e.+(5(l-5))i/i)^ 

6il-S){l+l3n^P,JNaa,) 


{-I3aa.-Pij+(S{1-S)y/6)S, 


1 


j Pjj 


ATere. ’ 

2/3n,Pi,- (5(l-(5)(l-/3n3Pij/Aro-ed 

AT ere. + (P,^-_/3cre,+(5(l-5))i/«)« 


Py < -13(70. 

-13(70, < Pij < I3a0, 

Pi] > 13(70^ 

Pij < -(3a0, 

-P<70, < Pij < 13(70^ 
Pij > 13(70,. 


(34) 


(35) 


In the above we have used the lower case tj to denote the j-th element of the transformed projected 
dataset, where the upper case T^- denotes the ij-th element of the I x N matrix with j-th column 
equal to tj. The benefit of this further decomposition lies in the fact that the majority of terms in 
the sums in (33) are zero. In fact, 


1 

2 





dT,m 


E 

jiim 




PjPm 



where for the function given in Eq. (30) we have. 



dT„ 


T-—T- 



aa 


+ 1 


exp 



(36) 


(37) 


For the normalised Laplacian, the reduced K x K eigenproblem has precisely the same form as the 
original N x N problem, with the only difference being the introduction of the factors njU],. In 
particular, the second eigenvalue of the normalised Laplacian of P is equal to the second eigenvalue of 
the Laplacian of the graph of P'^ with similarities given by njnks{P'^, j, k). With the derivation in 
Section]^ we can see that the corresponding derivat ive is as for the standard Laplacian, except that 
the coefficients {uf /— u'^/y/nk)‘^njnk in Eq. (36) are replaced with {u^/yJdj—u'^/\/dk)'^ — 
X{{uf)‘^/dj + (u^)^/d/c), where A is the second eigenvalue of the normalised Laplacian of P^, is 

the corresponding eigenvector and dj is the degree of the j-th element of P^. 
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